16 research outputs found

    New approaches for content-based analysis towards online social network spam detection

    Get PDF
    Unsolicited email campaigns remain as one of the biggest threats affecting millions of users per day. Although spam filtering techniques are capable of detecting significant percentage of the spam messages, the problem is far from being solved, specially due to the total amount of spam traffic that flows over the Internet, and new potential attack vectors used by malicious users. The deeply entrenched use of Online Social Networks (OSNs), where millions of users share unconsciously any kind of personal data, offers a very attractive channel to attackers. Those sites provide two main interesting areas for malicious activities: exploitation of the huge amount of information stored in the profiles of the users, and the possibility of targeting user addresses and user spaces through their personal profiles, groups, pages... Consequently, new type of targeted attacks are being detected in those communication means. Being selling products, creating social alarm, creating public awareness campaigns, generating traffic with viral contents, fooling users with suspicious attachments, etc. the main purpose of spam messages, those type of communications have a specific writing style that spam filtering can take advantage of. The main objectives of this thesis are: (i) to demonstrate that it is possible to develop new targeted attacks exploiting personalized spam campaigns using OSN information, and (ii) to design and validate novel spam detection methods that help detecting the intentionality of the messages, using natural language processing techniques, in order to classify them as spam or legitimate. Additionally, those methods must be effective also dealing with the spam that is appearing in OSNs. To achieve the first objective a system to design and send personalized spam campaigns is proposed. We extract automatically users’ public information from a well known social site. We analyze it and design different templates taking into account the preferences of the users. After that, different experiments are carried out sending typical and personalized spam. The results show that the click-through rate is considerably improved with this new strategy. In the second part of the thesis we propose three novel spam filtering methods. Those methods aim to detect non-evident illegitimate intent in order to add valid information that is used by spam classifiers. To detect the intentionality of the texts, we hypothesize that sentiment analysis and personality recognition techniques could provide new means to differentiate spam text from legitimate one. Taking into account this assumption, we present three different methods: the first one uses sentiment analysis to extract the polarity feature of each analyzed text, thus we analyze the optimistic or pessimistic attitude of spam messages compared to legitimate texts. The second one uses personality recognition techniques to add personality dimensions (Extroversion/Introversion, Thinking/Feeling, Judging/ Perceiving and Sensing/iNtuition) to the spam filtering process; and the last one is a combination of the two previously mentioned techniques. Once the methods are described, we experimentally validate the proposed approaches in three different types of spam: email spam, SMS spam and spam from a popular OSN.Hartzailearen baimenik gabe bidalitako mezuak (spam) egunean milioika erabiltzaileri eragiten dien mehatxua dira. Nahiz eta spam detekzio tresnek gero eta emaitza hobeagoak lortu, arazoa konpontzetik oso urruti dago oraindik, batez ere spam kopuruari eta erasotzaileen estrategia berriei esker. Hori gutxi ez eta azken urteetan sare sozialek izan duten erabiltzaile gorakadaren ondorioz, non milioika erabiltzailek beraien datu pribatuak publiko egiten dituzten, gune hauek oso leku erakargarriak bilakatu dira erasotzaileentzat. Batez ere bi arlo interesgarri eskaintzen dituzte webgune hauek: profiletan pilatutako informazio guztiaren ustiapena, eta erabiltzaileekin harreman zuzena izateko erraztasuna (profil bidez, talde bidez, orrialde bidez...). Ondorioz, gero eta ekintza ilegal gehiago atzematen ari dira webgune hauetan. Spam mezuen helburu nagusienak zerbait saldu, alarma soziala sortu, sentsibilizazio kanpainak martxan jarri, etab. izaki, mezu mota hauek eduki ohi duten idazketa mezua berauen detekziorako erabilia izan daiteke. Lan honen helburu nagusiak ondorengoak dira: alde batetik, sare sozialetako informazio publikoa erabiliz egungo detekzio sistemak saihestuko dituen spam pertsonalizatua garatzea posible dela erakustea; eta bestetik hizkuntza naturalaren prozesamendurako teknikak erabiliz, testuen intentzionalitatea atzeman eta spam-a detektatzeko metodologia berriak garatzea. Gainera, sistema horiek sare sozialetako spam mezuekin lan egiteko gaitasuna ere izan beharko dute. Lehen helburu hori lortzekolan honetan spam pertsonalizatua diseinatu eta bidaltzeko sistema bat aurkeztu da. Era automatikoan erabiltzaileen informazio publikoa ateratzen dugu sare sozial ospetsu batetik, ondoren informazio hori aztertu eta txantiloi ezberdinak garatzen ditugu erabiltzaileen iritziak kontuan hartuaz. Behin hori egindakoan, hainbat esperimentu burutzen ditugu spam normala eta pertsonalizatua bidaliz, bien arteko emaitzen ezberdintasuna alderatzeko. Tesiaren bigarren zatian hiru spam atzemate metodologia berri aurkezten ditugu. Berauen helburua tribialak ez den intentzio komertziala atzeman ta hori baliatuz spam mezuak sailkatzean datza. Intentzionalitate hori lortze aldera, analisi sentimentala eta pertsonalitate detekzio teknikak erabiltzen ditugu. Modu honetan, hiru sistema ezberdin aurkezten dira hemen: lehenengoa analisi sentimentala soilik erabiliz, bigarrena lan honetarako pertsonalitate detekzio teknikek eskaintzen dutena aztertzen duena, eta azkenik, bien arteko konbinazioa. Tresna hauek erabiliz, balidazio esperimentala burutzen da proposatutako sistemak eraginkorrak diren edo ez aztertzeko, hiru mota ezberdinetako spam-arekin lan eginez: email spam-a, SMS spam-a eta sare sozial ospetsu bateko spam-a

    Nuevos Paradigmas de Análisis Basados en Contenidos para la Detección del Spam en RRSS

    Get PDF
    Tesis doctoral realizada por Enaitz Ezpeleta Gallastegi en Mondragon Unibertsitatea, dentro del grupo de Sistemas Inteligentes para Sistemas Industriales, dirigida por los Doctores Urko Zurutuza Ortega (Mondragon Unibertsitatea) y José María Gómez Hidalgo (Pragsis Technologies). La defensa se efectúo el 30 de septiembre de 2016 en Arrasate. El tribunal estuvo conformado por el Dr. Manel Medina Llinas (Universitat Politecnica de Catalunya), el Dr. Magnus Almgren (Chalmers University of Technology), el Dr. Igor Santos Grueiro (Universidad de Deusto), el Dr. José Ramón Méndez Reboredo (Universidad de Vigo) y el Dr. D. Iñaki Garitano Garitano (Mondragon Unibertsitatea). La tesis obtuvo una calificación de Sobresaliente Cum Laude y la mención "Doctor Europeus"

    Short Messages Spam Filtering Using Sentiment Analysis

    Get PDF
    In the same way that short instant messages are more and more used, spam and non-legitimate campaigns through this type of communication systems are growing up. Those campaigns, besides being an illegal online activity, are a direct threat to the privacy of the users. Previous short messages spam filtering techniques focus on automatic text classification and do not take message polarity into account. Focusing on phone SMS messages, this work demonstrates that it is possible to improve spam filtering in short message services using sentiment analysis techniques. Using a publicly available labelled (spam/legitimate) SMS dataset, we calculate the polarity of each message and aggregate the polarity score to the original dataset, creating new datasets. We compare the results of the best classifiers and filters over the different datasets (with and without polarity) in order to demonstrate the influence of the polarity. Experiments show that polarity score improves the SMS spam classification, on the one hand, reaching to a 98.91% of accuracy. And on the other hand, obtaining a result of 0 false positives with 98.67% of accuracy

    Uso de Técnicas de Reconocimiento de la Personalidad para Mejorar el Filtrado Bayesiano de Spam

    Get PDF
    Millions of users per day are affected by unsolicited email campaigns. During the last years several techniques to detect spam have been developed, achieving specially good results using machine learning algorithms. In this work we provide a baseline for a new spam filtering method. Carrying out this research we validate our hypothesis that personality recognition techniques can help in Bayesian spam filtering. We add the personality feature to each email using personality recognition techniques, and then we compare Bayesian spam filters with and without personality in terms of accuracy. In a second experiment we combine personality and polarity features of each message and we compare all the results. At the end, the top ten Bayesian filtering classifiers have been improved, reaching to a 99.24% of accuracy, reducing also the false positive number.Millones de usuarios se ven afectados por las campanas de envío de correos electrónicos no deseados al día. Durante los últimos años diferentes técnicas de detección de spam han sido desarrollados por investigadores, obteniendo especialmente buenos resultados con algoritmos de aprendizaje automático. En este trabajo presentamos una base para un nuevo método de filtrado de spam. Durante el estudio hemos validado la hipótesis de que las técnicas de reconocimiento de personalidad pueden ayudar a mejorar el filtrado Bayesiano de spam. Usando estas técnicas de filtrado, añadimos la característica de personalidad a cada correo, y después comparamos los resultados del filtrado Bayesiano de spam con y sin personalidad, analizando los resultados en términos de exactitud. En un segundo experimento, combinamos las características de personalidad y polaridad de cada mensaje, y comparamos los resultados. Al final, conseguimos mejorar los resultados del filtrado Bayesiano de spam, alcanzando el 99,24% de exactitud, y reduciendo el número de falsos positivos.This work has been partially funded by the Basque Department of Education, Language policy and Culture under the project SocialSPAM (PI_2014_1_102)

    Validation of Random Forest Machine Learning Models to Predict Dementia-Related Neuropsychiatric Symptoms in Real-World Data

    Get PDF
    Background: Neuropsychiatric symptoms (NPS) are the leading cause of the social burden of dementia but their role is underestimated. Objective: The objective of the study was to validate predictive models to separately identify psychotic and depressive symptoms in patients diagnosed with dementia using clinical databases representing the whole population to inform decisionmakers. Methods: First, we searched the electronic health records of 4,003 patients with dementia to identify NPS. Second, machine learning (random forest) algorithms were applied to build separate predictive models for psychotic and depressive symptom clusters in the training set (N = 3,003). Third, calibration and discrimination were assessed in the test set (N = 1,000) to assess the performance of the models. Results: Neuropsychiatric symptoms were noted in the electronic health record of 58% of patients. The area under the receiver operating curve reached 0.80 for the psychotic cluster model and 0.74 for the depressive cluster model. The Kappa index and accuracy also showed better discrimination in the psychotic model. Calibration plots indicated that both types of model had less predictive accuracy when the probability of neuropsychiatric symptoms was <25%. The most important variables in the psychotic cluster model were use of risperidone, level of sedation, use of quetiapine and haloperidol and the number of antipsychotics prescribed. In the depressive cluster model, the most important variables were number of antidepressants prescribed, escitalopram use, level of sedation, and age. Conclusion: Given their relatively good performance, the predictive models can be used to estimate prevalence of NPS in population databases

    Short Messages Spam Filtering Combining Personality Recognition and Sentiment Analysis

    Get PDF
    Currently, short communication channels are growing up due to the huge increase in the number of smartphones and online social networks users. This growth attracts malicious campaigns, such as spam campaigns, that are a direct threat to the security and privacy of the users. While most researches are focused on automatic text classification, in this work we demonstrate the possibility of improving current short messages spam detection systems using a novel method. We combine personality recognition and sentiment analysis techniques to analyze Short Message Services (SMS) texts. We enrich a publicly available dataset adding these features, first separately and after in combination, of each message to the dataset, creating new datasets. We apply several combinations of the best SMS spam classifiers and filters to each dataset in order to compare the results of each one. Taking into account the experimental results we analyze the real inuence of each feature and the combination of both. At the end, the best results are improved in terms of accuracy, reaching to a 99.01% and the number of false positive is reduced

    A study of the personalization of spam content using Facebook public information

    Get PDF
    Millions of users per day are affected by unsolicited email campaigns. Spam filters are capable of detecting and avoiding an increasing number of messages, but researchers have quantified a response rate of a 0.006% [1], still significant to turn a considerable profit sending millions of emails, as the spammers do. While research directions are addressing topics such as better spam filters, or spam detection inside online social networks, in this paper we demonstrate that a classic spam model using online social network information can harvest a 7.62% of click-through rate. We collect email addresses from the Internet, complete email owner information using their public social network profile data, and analyze response of personalized spam sent to users according to their profile using a fake website. Finally we demonstrate the effectiveness of these profile-based emails to circumvent spam detection and we compare results between typical spam and personalized spam

    Deobfuscating Leetspeak With Deep Learning to Improve Spam Filtering

    Get PDF
    The evolution of anti-spam filters has forced spammers to make greater efforts to bypass filters in order to distribute content over networks. The distribution of content encoded in images or the use of Leetspeak are concrete and clear examples of techniques currently used to bypass filters. Despite the importance of dealing with these problems, the number of studies to solve them is quite small, and the reported performance is very limited. This study reviews the work done so far (very rudimentary) for Leetspeak deobfuscation and proposes a new technique based on using neural networks for decoding purposes. In addition, we distribute an image database specifically created for training Leetspeak decoding models. We have also created and made available four different corpora to analyse the performance of Leetspeak decoding schemes. Using these corpora, we have experimentally evaluated our neural network approach for decoding Leetspeak. The results obtained have shown the usefulness of the proposed model for addressing the deobfuscation of Leetspeak character sequences

    Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

    Get PDF
    Despite new developments in machine learning classification techniques, improving the accuracy of spam filtering is a difficult task due to linguistic phenomena that limit its effectiveness. In particular, we highlight polysemy, synonymy, the usage of hypernyms/hyponyms, and the presence of irrelevant/confusing words. These problems should be solved at the pre-processing stage to avoid using inconsistent information in the building of classification models. Previous studies have suggested that the use of synset-based representation strategies could be successfully used to solve synonymy and polysemy problems. Complementarily, it is possible to take advantage of hyponymy/hypernymy-based to implement dimensionality reduction strategies. These strategies could unify textual terms to model the intentions of the document without losing any information ( e.g. , bringing together the synsets “viagra”, “ciallis”, “levitra” and other representing similar drugs by using “virility drug” which is a hyponym for all of them). These feature reduction schemes are known as lossless strategies as the information is not removed but only generalised. However, in some types of text classification problems (such as spam filtering) it may not be worthwhile to keep all the information and let dimensionality reduction algorithms discard information that may be irrelevant or confusing. In this work, we are introducing the feature reduction as a multi-objective optimisation problem to be solved using a Multi-Objective Evolutionary Algorithm (MOEA). Our algorithm allows, with minor modifications, to implement lossless (using only semantic-based synset grouping), low-loss (discarding irrelevant information and using semantic-based synset grouping) or lossy (discarding only irrelevant information) strategies. The contribution of this study is two-fold: (i) to introduce different dimensionality reduction methods (lossless, low-loss and lossy) as an optimization problem that can be solved using MOEA and (ii) to provide an experimental comparison of lossless and low-loss schemes for text representation. The results obtained support the usefulness of the low-loss method to improve the efficiency of classifiers.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-84658-C2-2-RXunta de Galicia | Ref. ED431C 2022/03-GRCEusko Jaurlaritza | Ref. IT1676-22Fundação para a Ciência e a Tecnologia | Ref. UIDB/04466/2020Fundação para a Ciência e a Tecnologia | Ref. UIDP/04466/202

    Visualization of Misuse-Based Intrusion Detection: Application to Honeynet Data

    Get PDF
    This study presents a novel soft computing system that provides network managers with a synthetic and intuitive representation of the situation of the monitored network, in order to reduce the widely known high false-positive rate associated to misuse-based Intrusion Detection Systems (IDSs). The proposed system is based on the use of different projection methods for the visual inspection of honeypot data, and may be seen as a complementary network security tool that sheds light on internal data structures through visual inspection. Furthermore, it is intended to understand the performance of Snort (a well-known misuse-based IDS) through the visualization of attack patterns. Empirical verification and comparison of the proposed projection methods are performed in a real domain where real-life data are defined and analyzed
    corecore